Designing better data plots at the Burnet Institute

Dianne Cook
Monash University

Hello 👋🏼

🧩 Feel free to ask questions any time. 🤔


🎯 The objectives for today are:

  1. Build your knowledge of cognitive perception principles for good graphics
  2. Recognise elements of a current design that can be improved
  3. Develop skills to implement improved design

Outline

time topic
1:00 Data visualisation principles
1:30 Identifying poor elements in a plot
2:00 Fixing a plot design
2:30 BREAK

Cognitive perception principles

Hierarchy of mappings (1/2)



Cleveland and McGill (1984)



Illustrations made by Emi Tanaka

Hierarchy of mappings (2/2)

Based on the accuracy with which readers returned the numerical values.

  1. Position - common scale (BEST)
  2. Position - nonaligned scale
  3. Length, direction, angle
  4. Area
  5. Volume, curvature
  6. Shading, color (WORST)

Primary mapping used in common plots

  1. scatterplot, barchart
  2. side-by-side boxplot, stacked barchart
  3. piechart, rose plot, gauge plot, donut, wind direction map, starplot
  4. treemap, bubble chart, mosaicplot
  5. chernoff face
  6. choropleth map

Proximity

Place elements that you want to compare close to each other. If there are multiple comparisons to make, you need to decide which one is most important.

Code
# load libraries
library(tidyverse)
library(colorspace)
library(patchwork)
library(broom)
library(palmerpenguins)
library(ggbeeswarm)
library(vcd)
library(nullabor)
library(MASS)
library(colorspace)
library(conflicted)
conflicts_prefer(dplyr::filter)
conflicts_prefer(dplyr::select)
conflicts_prefer(dplyr::slice)
conflicts_prefer(dplyr::rename)
conflicts_prefer(dplyr::mutate)
conflicts_prefer(dplyr::summarise)

# prepare data
tb <- read_csv("data/TB_notifications_2023-08-21.csv") |>
  filter(country == "Australia", year > 1996, year < 2013) |>
  select(year, contains("new_sp")) 
tb_tidy <- tb |>
  select(-new_sp, -new_sp_m04, -new_sp_m514, 
                  -new_sp_f04, -new_sp_f514) |> 
  pivot_longer(starts_with("new_sp"), 
    names_to = "sexage", 
    values_to = "count") |>
  mutate(sexage = str_remove(sexage, "new_sp_")) |>
  separate_wider_position(
    sexage,
    widths = c(sex = 1, age = 4),
    too_few = "align_start"
  ) |>
  filter(age != "u") |>
  mutate(age = fct_recode(age, "0-14" = "014",
                          "15-24" = "1524",
                          "15-24" = "1524",
                          "25-34" = "2534",
                          "35-44" = "3544",
                          "45-54" = "4554",
                          "55-64" = "5564",
                          "> 65" = "65"))
# plot 
tb_tidy |> 
  filter(!(age %in% c("0-14", "unknown"))) |>
  ggplot(aes(x=year, 
           y=count, 
           colour=sex)) + 
  geom_point() +
  geom_smooth(se=F) +
  facet_wrap(~age, ncol = 3) +
  scale_color_discrete_divergingx(palette="Geyser") +
  scale_x_continuous("year", 
    breaks = seq(1998, 2012, 2), 
    labels = c("98", "00", "02", "04", "06", "08", "10", "12")) +
  theme(axis.text = element_text(size=10)) +
  ggtitle("Arrangement A")

Code
tb_tidy |> 
  filter(!(age %in% c("0-14", "unknown"))) |>
  ggplot(aes(x = year, 
             y = count, 
             colour = age)) +
  geom_point() +
  geom_smooth(se=F) +
  facet_wrap(~sex, ncol = 2) +
  scale_color_discrete_divergingx(palette="Zissou 1") +
  scale_x_continuous("year", 
    breaks = seq(1998, 2012, 2), 
    labels = c("98", "00", "02", "04", "06", "08", "10", "12")) +
  theme(axis.text = element_text(size=10)) +
  ggtitle("Arrangement B")

Change blindness (1/2)

Making comparisons across plots requires the eye to jump from one focal point to another. It may result in not noticing differences.

Code
tb_tidy |>
  filter(age %in% c("45-54", "55-64"),
         sex == "f") |>
  ggplot(mapping=aes(x=year, 
                 y=count)) + 
  geom_point() +
  geom_smooth(aes(colour=age), se=F, method="lm") +
  facet_wrap(~age, ncol = 2) +
  scale_color_discrete_divergingx(palette="Geyser") +
  scale_x_continuous("year", 
    breaks = seq(1998, 2012, 4), 
    labels = c("98", "02", "06", "10")) +
  theme(legend.position="none",
        axis.text = element_text(size=10))


Code
tb_tidy |>
  filter(age %in% c("45-54", "55-64"),
         sex == "f") |>
  ggplot(mapping=aes(x=year, 
                 y=count)) + 
  geom_smooth(aes(colour=age), se=F, method="lm") +
  scale_color_discrete_divergingx(palette="Geyser") +
  scale_x_continuous("year", 
    breaks = seq(1998, 2012, 4), 
    labels = c("98", "02", "06", "10")) +
  theme(legend.position="none",
        axis.text = element_text(size=10))

Change blindness (2/2)


Help the reader remember what the pattern is in other panels by under-plotting all.

Code
ggplot(olives, aes(palmitoleic, palmitic, color = Area)) +
  geom_point() +
  scale_color_discrete_divergingx(palette="Zissou 1") 

Too many colours, too busy

Code
ggplot(olives, aes(palmitoleic, palmitic, color = Area)) +
  geom_point() +
  facet_wrap(~Area) +
  scale_color_discrete_divergingx(palette="Zissou 1") +
  guides(color = FALSE) 

Code
ggplot(olives, aes(palmitoleic, palmitic)) +
  geom_point(data = dplyr::select(olives, -Area), color = "gray80") +
  geom_point(aes(color = Area), size=2) +
  facet_wrap(~Area) +
  scale_color_discrete_divergingx(palette="Zissou 1") +
  guides(color = FALSE)

Pre-attentive

Can you find the odd one out?

Code
set.seed(209)
df <- data.frame(
  x=runif(100), 
  y=runif(100), 
  cl=sample(c(rep("A", 1), rep("B", 99))))
ggplot(data=df, aes(x, y, shape=cl)) + 
  geom_point(size=3, alpha=0.8) +
  theme(legend.position="None", aspect.ratio=1)

Is it easier now?

Code
set.seed(454)
df <- data.frame(
  x=runif(100), 
  y=runif(100), 
  cl=sample(c(rep("A", 1), rep("B", 99))))
ggplot(data=df, aes(x, y, colour=cl)) + 
  geom_point(size=3, alpha=0.8) +
  scale_color_discrete_divergingx(palette="Zissou 1") +
  theme(legend.position="None", aspect.ratio=1)

Colour palettes should match variable type (1/2)

There are three basic choices of palettes:

  • qualitative
  • sequential
  • diverging
  • (rainbow)
  • (palindrome) SKIPPED

Which one you choose depends on the

  • data values
  • and what to emphasize

Resources for exploring color:

Code
V1 = tibble(x = 1:7, 
            native = factor(c("quoll", "emu", "roo", 
            "bilby", "quokka", "dingo", "numbat")))
ggplot(V1, aes(x=x, y=1, fill=native)) +
  geom_tile() +
  geom_text(aes(x=x, y=1, label=native)) +
  ggtitle("qualitative") + 
  theme_minimal() +
  theme(legend.position = "none", 
        panel.background =
                    element_rect(fill = 'transparent', colour = NA),
        axis.text = element_blank(),
        axis.title = element_blank(),
        axis.ticks = element_blank(),
        #axis.line = element_line(colour = "white"),
        panel.grid.major = element_line(colour = "white"),
        panel.grid.minor = element_line(colour = "white"))

Code
V2 = tibble(x = 1:7, 
            fill = 1:7)
ggplot(V2, aes(x=x, y=1, fill=fill)) +
  geom_tile() +
  geom_text(aes(x=x, y=1, label=fill)) +
  ggtitle("sequential: emphasise high") + 
  scale_fill_continuous_sequential(palette = "PinkYl") +
  theme_minimal() +
  theme(legend.position = "none", 
        panel.background =
                    element_rect(fill = 'transparent', colour = NA),
        axis.text = element_blank(),
        axis.title = element_blank(),
        axis.ticks = element_blank(),
        #axis.line = element_line(colour = "white"),
        panel.grid.major = element_line(colour = "white"),
        panel.grid.minor = element_line(colour = "white"))

Code
V3 = tibble(x = 1:7, 
            fill = -3:3)
ggplot(V3, aes(x=x, y=1, fill=fill)) +
  geom_tile() +
  geom_text(aes(x=x, y=1, label=fill)) +
  ggtitle("diverging: emphasise high and low") + 
  scale_fill_continuous_divergingx(palette = "ArmyRose") +
  theme_minimal() +
  theme(legend.position = "none", 
        panel.background =
                    element_rect(fill = 'transparent', colour = NA),
        axis.text = element_blank(),
        axis.title = element_blank(),
        axis.ticks = element_blank(),
        #axis.line = element_line(colour = "white"),
        panel.grid.major = element_line(colour = "white"),
        panel.grid.minor = element_line(colour = "white"))

rainbow palettes (2/3)

Jet rainbow palette

Code
library(vital)
library(viridis)
am <- aus_mortality |> 
  filter(State == "Victoria", 
         Sex != "total", 
         Year < 1980, 
         Age < 90) 

ggplot(am, aes(x=Age, y=Mortality, colour=Year, group=Year)) + 
    geom_line() +
    facet_wrap(~Sex, ncol=1) +
    scale_color_gradientn(colours = rainbow(10)) +
    scale_y_log10() + 
    theme(aspect.ratio = 0.5)

Produces false detail, banding and color blindness ambiguity.

viridis palettes

Code
ggplot(am, aes(x=Age, y=Mortality, colour=Year, group=Year)) + 
    geom_line() +
    facet_wrap(~Sex, ncol=1) +
    scale_colour_gradientn(colors = viridis_pal(option = "turbo")(10)[10:1]) +
    scale_y_log10() + 
    theme(aspect.ratio = 0.5)

Have a uniform scale, match grey scale ladder. The turbo palette alleviates Jet rainbow palette problems.

rainbow palettes (3/3)

Jet rainbow palette

Code
ggplot(am, aes(x=Age, y=Mortality, colour=Year, group=Year)) + 
    geom_line() +
    facet_wrap(~Sex, ncol=1) +
    scale_color_gradientn(colours = deutan(rainbow(10))) +
    scale_y_log10() + 
    theme(aspect.ratio = 0.5)

Produces false detail, banding and ambiguity.

viridis palettes

Code
ggplot(am, aes(x=Age, y=Mortality, colour=Year, group=Year)) + 
    geom_line() +
    facet_wrap(~Sex, ncol=1) +
    scale_colour_gradientn(colors = deutan(viridis_pal(option = "turbo")(10)[10:1])) +
    scale_y_log10() + 
    theme(aspect.ratio = 0.5)

Colors still readable and following scale.

Transforming, e.g. colour scales

If the variable mapped to colour has a right-skewed distribution, consider transforming it using a log or a square root.

Code
ggplot(as_tibble(Titanic), aes(x=interaction(Sex, Age),
                               y=interaction(Class, Survived), 
                               fill=n)) +
  geom_tile() +
  xlab("Sex, Age") +
  ylab("Class, Survived") +
  scale_fill_continuous_sequential(palette = "Terrain")


This is the same data, where count has been transformed using square root.

Code
ggplot(as_tibble(Titanic), 
       aes(x=interaction(Sex, Age),
           y=interaction(Class, Survived), 
           fill=n)) +
  geom_tile() +
  xlab("Sex, Age") +
  ylab("Class, Survived") +
  scale_fill_continuous_sequential(
    palette = "Terrain", 
    trans="sqrt")

Order categorical variables by the statistic

Code
# remotes::install_github("kevinwang09/learningtower")
library(learningtower)
student_data_2018 <- load_student(2018)
student_means <- student_data_2018 |>
  group_by(country) |>
  summarise(math = mean(math, na.rm=TRUE),
            read = mean(read, na.rm=TRUE),
            science = mean(science, na.rm=TRUE))
save(student_data_2018,
  file="data/student_data_2018.rda")
save(student_means, file="data/student_means.rda")

# Compute differences and bootstrap
student2018_stats <- student_data_2018 %>%
  group_by(country) %>%
  summarise(mathgap=mean(math[gender=="male"],
                           na.rm=TRUE)-
                    mean(math[gender=="female"],
                           na.rm=TRUE),
            wmathgap=weighted.mean(
                    math[gender=="male"],
                      w=stu_wgt[gender=="male"],
                        na.rm=T)-
                     weighted.mean(
                    math[gender=="female"],
                      w=stu_wgt[gender=="female"], 
                        na.rm=T),
            readgap=mean(read[gender=="male"],
                           na.rm=TRUE)-
                    mean(read[gender=="female"],
                           na.rm=TRUE),
            wreadgap=weighted.mean(
                    read[gender=="male"],
                      w=stu_wgt[gender=="male"],
                        na.rm=T)-
                     weighted.mean(
                    read[gender=="female"],
                      w=stu_wgt[gender=="female"], 
                        na.rm=T))
save(student2018_stats, file="data/student2018_stats.rda")

library(boot)
cimathfn <- function(d, i) {
  x <- d[i,]
  if (nrow(x) == 0) {
    ci <- 0
  }
  else {
    ci <- weighted.mean(x$math[x$gender=="male"], 
       w=x$stu_wgt[x$gender=="male"], na.rm=T)-
     weighted.mean(x$math[x$gender=="female"],
        w=x$stu_wgt[x$gender=="female"], na.rm=T)
  }
  ci
}
cireadfn <- function(d, i) {
  x <- d[i,]
  if (nrow(x) == 0) {
    ci <- 0
  }
  else {
    ci <- weighted.mean(x$read[x$gender=="male"], 
       w=x$stu_wgt[x$gender=="male"], na.rm=T)-
     weighted.mean(x$read[x$gender=="female"],
        w=x$stu_wgt[x$gender=="female"], na.rm=T)
  }
  ci
}
bootmathfn <- function(d) {
  if (nrow(d) == 0) {
    ci <- c(0, 0)
  }
  else {
    r <- boot(d, statistic=cimathfn, R=200)
    l <- sort(r$t)[5]
    u <- sort(r$t)[195]
    ci <- c(l, u)
  }
  return(ci)
}
bootreadfn <- function(d) {
  if (nrow(d) == 0) {
    ci <- c(0, 0)
  }
  else {
    r <- boot(d, statistic=cireadfn, R=200)
    l <- sort(r$t)[5]
    u <- sort(r$t)[195]
    ci <- c(l, u)
  }
  return(ci)
}
math_results <- student_data_2018 %>%
  split(.$country) %>% 
  purrr::map(bootmathfn) 
cnt <- names(math_results)
math_results_tb <- tibble(country = rep(cnt, rep(2, length(cnt))), 
            ci = rep(c("l", "u"), length(cnt)),
            value=unlist(math_results)) 
math_results_tb <- math_results_tb |>
  pivot_wider(names_from = ci, values_from = value) |>
  filter(!(l == 0 & u == 0))
read_results <- student_data_2018 %>%
  split(.$country) %>% 
  purrr::map(bootreadfn) 
cnt <- names(read_results)
read_results_tb <- tibble(country = rep(cnt, rep(2, length(cnt))), 
            ci = rep(c("l", "u"), length(cnt)),
            value=unlist(read_results)) 
read_results_tb <- read_results_tb |>
  pivot_wider(names_from = ci, values_from = value) |>
  filter(!(l == 0 & u == 0))
save(math_results_tb,
  file="data/math_results_tb.rda")
save(read_results_tb,
  file="data/read_results_tb.rda")
❌ Default: alphabetical
Code
load("data/student_means.rda")
student_means_sub <- student_means |>
  filter(country %in% c("SGP", "KOR", "POL", "DEU", "NOR", "IRL", "GBR", "IDN", "AUS", "NZL", "USA", "TUR", "PHL", "MAR", "URY", "CHL", "COL", "CAN"))
ggplot(student_means_sub, aes(x=country, y=math)) + 
  geom_point(colour="#8ACE00", size=4) + 
  coord_flip() +
  xlab("") +
  theme(aspect.ratio = 2)

Full scale of number
Code
ggplot(student_means_sub, aes(x=country, y=math)) + 
  geom_point(colour="#8ACE00", size=4) + 
  coord_flip() +
  xlab("") +
  ylim(c(0, 1000)) +
  theme(aspect.ratio = 2)

✅ Order by statistic
Code
ggplot(student_means_sub, 
       aes(x=fct_reorder(country, math), 
           y=math)) + 
  geom_point(colour="#8ACE00", size=4) + 
  coord_flip() +
  xlab("") +
  ylim(c(0, 1000)) +
  theme(aspect.ratio = 2)

Read more about OECD PISA

Do the calculation for the reader

Code
data(anorexia, package="MASS")
ggplot(data=anorexia, 
  aes(x=Prewt, 
      y=Postwt, 
        colour=Treat)) + 
  coord_equal() +
  xlim(c(70, 110)) + 
  ylim(c(70, 110)) +
  xlab("Pre-treatment weight (lbs)") +  
  ylab("Post-treatment weight (lbs)") +
  geom_abline(intercept=0, slope=1,  
    colour="grey80", linewidth=1.25) + 
  geom_density2d() + 
  geom_point(size=3) +
  facet_grid(.~Treat) +
  theme(legend.position = "none")

  • Before and after treatment weight for anorexia patients
  • Three different treatments
  • Need to read the difference relative to a 45\(^o\) line
Code
ggplot(data=anorexia, 
  aes(x=Prewt, colour=Treat,
    y=(Postwt-Prewt)/Prewt*100)) + 
  xlab("Pre-treatment weight (lbs)") +  
  ylab("Percent increase in weight") +
  geom_hline(yintercept=0, linewidth=1.25, 
    colour="grey80") + 
  geom_point(size=3) +   
  facet_grid(.~Treat) +
  theme(aspect.ratio=1, legend.position = "none")

  • Compute the difference
  • Compare difference relative to before weight
  • Before weight is used as the baseline
  • EASIER to read the difference above and below a horizontal line

Aspect ratio (1/2)

❌ Wrong aspect ratio

Code
ggplot(data=anorexia, 
 aes(x=Prewt, y=Postwt, 
    colour=Treat)) + 
 xlim(c(70, 110)) + ylim(c(70, 110)) +
 xlab("Pre-treatment weight (lbs)") +  
 ylab("Post-treatment weight (lbs)") +
 geom_abline(intercept=0, slope=1,  
   colour="grey80", linewidth=1.25) + 
 geom_density2d() + 
 geom_point(size=3) +
 facet_wrap(~Treat, ncol=1) +
 theme(legend.position = "none",
       aspect.ratio = 0.5) #exaggerated


The default aspect ratio in most plots is rectangular.



If you want to compare two quantities, including assessing correlation, the aspect ratio should be square.



Two ways to achieve this with ggplot2:

  • theme(aspect.ratio=1) PREFERRED
  • coord_equal()

Aspect ratio (2/2)

Code
tb_tidy |> 
  filter(!(age %in% c("0-14", "unknown"))) |>
  ggplot(aes(x=year, 
           y=count, 
           colour=sex)) + 
  geom_point() +
  geom_smooth(se=F) +
  facet_wrap(~age, ncol = 1) +
  scale_color_discrete_divergingx(palette="Zissou 1") +
  scale_x_continuous("year", 
    breaks = seq(1998, 2012, 2), 
    labels = c("98", "00", "02", "04", "06", "08", "10", "12")) +
  theme(axis.text = element_text(size=10)) +
  ggtitle("Wrong aspect ratio")

Lines should be on average 45\(^o\).

  • To read and compare trend
  • To examine seasonality in time series

Mapping

A choropleth map is used to show a measured variable associated with a political or geographic region. Polygons for the region are filled with colour.

The purpose is to examine the spatial distribution of a variable.

Code
sa2 <- strayr::read_absmap("sa22011") |> 
  filter(!st_is_empty(geometry)) |> 
  filter(!state_name_2011 == "Other Territories") |> 
  filter(!sa2_name_2011 == "Lord Howe Island")
sa2 <- sa2 |> rmapshaper::ms_simplify(keep = 0.5, keep_shapes = TRUE) # Simplify the map!!!
SIR <- read_csv("data/SIR Downloadable Data.csv") |> 
  filter(SA2_name %in% sa2$sa2_name_2011) |> 
  dplyr::select(Cancer_name, SA2_name, Sex_name, p50) |> 
  filter(Cancer_name == "Thyroid", Sex_name == "Females")
ERP <- read_csv("data/ERP.csv") |>
  filter(REGIONTYPE == "SA2", Time == 2011, Region %in% SIR$SA2_name) |> 
  dplyr::select(Region, Value)
# Alternative maps
# Join with sa2 sf object
sa2thyroid_ERP <- SIR |> 
  left_join(sa2, ., by = c("sa2_name_2011" = "SA2_name")) |>
  left_join(., ERP |> 
              dplyr::select(Region, 
              Population = Value), by = c("sa2_name_2011"= "Region")) |> 
  filter(!st_is_empty(geometry))
sa2thyroid_ERP <- sa2thyroid_ERP |> 
  #filter(!is.na(Population)) |> 
  filter(!sa2_name_2011 == "Lord Howe Island") |> 
  mutate(SIR = map_chr(p50, aus_colours)) |> 
  st_as_sf() 
save(sa2, file="data/sa2.rda")
save(sa2thyroid_ERP, file="data/sa2thyroid_ERP.rda")
Code
# Plot the choropleth
load("data/sa2thyroid_ERP.rda")
aus_ggchoro <- ggplot(sa2thyroid_ERP) + 
  geom_sf(aes(fill = SIR), size = 0.1) + 
  scale_fill_identity() + invthm
aus_ggchoro

The problem with choropleth maps

The problem is that high density population areas may be very small geographically. They can disappear in a choropleth map, which means that we get a biased sense of the spatial distribution of a variable.

Cartograms

A cartogram transforms the geographic shape to match the value of a statistic or the population. Its a useful exploratory technique for examining the spatial distribution of a measured variable.



BUT they don’t work for Australia.

Code
# transform to NAD83 / UTM zone 16N
nc <- st_read(system.file("shape/nc.shp", package="sf"), quiet=TRUE)

nc <- nc |>
  mutate(lBIR79 = log(BIR79))
nc_utm <- st_transform(nc, 26916)

orig <- ggplot(nc) + 
  geom_sf(aes(fill = lBIR79)) +
  ggtitle("choropleth") +
  theme_map() +
  theme(legend.position = "none")

nc_utm_carto <- cartogram_cont(nc_utm, weight = "BIR74", itermax = 5)

carto <- ggplot(nc_utm_carto) + 
  geom_sf(aes(fill = lBIR79)) +
  ggtitle("cartogram") +
  theme_map() +
  theme(legend.position = "none")

nc_utm_dorl <- cartogram_dorling(nc_utm, weight = "BIR74")

dorl <- ggplot(nc_utm_dorl) + 
  geom_sf(aes(fill = lBIR79)) +
  ggtitle("dorling") +
  theme_map() +
  theme(legend.position = "none")

orig + carto + dorl + plot_layout(ncol=1)

Hexagon tile

A hexagon tile map represents every spatial polygon with an equal sized hexagon. In dense areas these will be tesselated, but separated hexagons are placed at centroids of the remote spatial regions.


Now the higher thyroid incidence in Perth suburbs, some Melbourne suburbs, and Sydney are more visible.

Code
if (!file.exists("data/aus_hexmap.rda")) {
  
## Create centroids set
centroids <- sa2 |> 
  create_centroids(., "sa2_name_2011")
## Create hexagon grid
grid <- create_grid(centroids = centroids,
                    hex_size = 0.2,
                    buffer_dist = 5)
## Allocate polygon centroids to hexagon grid points
aus_hexmap <- allocate(
  centroids = centroids,
  hex_grid = grid,
  sf_id = "sa2_name_2011",
  ## same column used in create_centroids
  hex_size = 0.2,
  ## same size used in create_grid
  hex_filter = 10,
  focal_points = capital_cities,
  width = 35,
  verbose = FALSE
)
save(aus_hexmap, 
     file = "data/aus_hexmap.rda")
}

load("data/aus_hexmap.rda")
## Prepare to plot
fort_hex <- fortify_hexagon(data = aus_hexmap,
                            sf_id = "sa2_name_2011",
                            hex_size = 0.2) |> 
            left_join(sa2thyroid_ERP |> select(sa2_name_2011, SIR, p50))
## Make a plot
aus_hexmap_plot <- ggplot() +
  geom_sf(data=sa2thyroid_ERP, fill=NA, colour="grey60", size=0.1) +
  geom_polygon(data = fort_hex, aes(x = long, y = lat, group = hex_id, fill = SIR)) +
  scale_fill_identity() +
  invthm 
aus_hexmap_plot  

Map thinning

The big change from working with maps in a GIS and maps for data analysis is the SIZE of the map data.

We are going to demonstrate how you need to change your approach, using the code in map.R.

The shapefiles for this example are downloaded from https://imcarto.webflow.io/gdb. It is the “Province” data for Indonesian Administrative boundaries.

Summary and more

Items that are primary elements of a plot:

  • colour
  • trend line (?)

Organising items:

  • place items to compare, close to each other
  • control the ordering, to make patterns easier to read
  • align axes for comparison across plots

Conventions:

  • time on horizontal
  • connecting dots
  • text horizontal
  • audience: academic, report, journalism

Calculations:

  • transformations to symmetry
  • do calculations for the reader
  • appropriate aspect ratio

Backgrounds:

  • Axes and text should sit in the background to be examined only when needing to interpret
  • Data elements should be pre-attentive, first items seen

Don’t repeat yourself: no units on each tick mark (e.g. %)

Data pre-processing:

  • to create mapping of variables
  • beware missing information

Identifying poor elements in a plot

Example 1

The distribution of responses is different if the declaration ceased.

  • position on a non-aligned scale
  • colour mapped to variable

polishing needed (cover later)

  • grey background
  • axis labels
  • diverging scale emphasizes both high and low

Example 2

Example 3

Example 4

Example 5

Example 6

  • blue category obscured by other two

Example 7

  • Four variables:
    • categorical:
    • temporal: sex, type of sex, period, weeks
  • ART initiation happens much faster after 2018, in all categories similarly.
  • Time is conventionally on horizontal axis.

Example 8

? studies are similar numerically, some have substantially more variation

  • Order: should it be by estimate? year? weight?
  • Is size of square matching weight? Is is matching area? or length of side. Could change width of line also, because size of square is not easily noticed.
  • Axis is marking min and max, unorthodox.

Example 9

data

  • the data has many thousands of genes, of which only a small number of most interesting are shown here.
  • variables in data:
    • treatment (CM1-4)
    • time (0/8/36)
    • replicate (3??)
    • expression
  • this data represents the top part of an iceberg, there’s more under the table than above it

observations

  • expression is different between two groups
  • four subsets illustrate the difference
  • most interesting genes are named
  • colour mapped to numerical value (expression)
  • colour used for different variables
  • colourblind-proof diverging colour scale applied to main variable

Fixing a plot design

Example 1

Code
# likert example
library(tibble)
library(ggplot2)
library(tidyr)
library(dplyr)
library(ggstats)
library(colorspace)

# generate some data
d <- data.frame(current = factor(rep(c("very unlikely", "unlikely", "neutral", "likely", "very likely"),
                        c(1,2,3,4,5)),
                        levels = c("very unlikely", "unlikely", "neutral", "likely", "very likely")),
            past = factor(rep(c("very unlikely", "unlikely", "neutral", "likely", "very likely"),
                        c(5,4,3,2,1)),
                        levels = c("very unlikely", "unlikely", "neutral", "likely", "very likely")))

# stacked bar
d_long <- d |>
  count(current, past) |>
  mutate(p = n/15) |>
  pivot_longer(cols = c(current, past), 
               names_to = "var", 
               values_to = "val") 
ggplot(d_long, aes(x=var, fill=val, y=p)) +
    geom_col(position="stack") +
    scale_fill_discrete_divergingx() +
    xlab("") +
    theme_minimal() 

Code
ggplot(d_long, aes(x=val, fill=val, y=p)) +
    geom_col() +
    facet_wrap(~var, ncol=1) +
    scale_fill_discrete_divergingx() +
    xlab("") +
    theme_bw() 

Code
ggplot(d_long, aes(x=val, y=p)) +
    geom_col() +
    facet_wrap(~var, ncol=1) +
    xlab("") +
    theme_bw() 

Code
gglikert(d, add_labels = FALSE, add_totals = FALSE) +
  scale_fill_discrete_divergingx() 

Example 9

  • data has gene, treatment, replicate
  • problem with heatmaps
  • see bigpint package: matrix and par coordinates

Heatmaps show all the data, with colour representing a numerical value.

Four chips (variables), 77 genes. How many clusters of genes do you see?

Scatterplot matrix shows that there are no actual clusters here.

  • genes are expressing differently between treatments
  • want to show signal (difference) relative to variance (variability within each treatment)

Assume:

  • each “chip” normalised correctly
  • select genes based on significance testing

Display:

  • parallel coordinate plot organised into pattern groups
  • individual profiles
  • classical interaction plot

Resources

End of session 1

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.